Optimizing Query Processing in Batch Streaming System

نویسندگان

  • Peter Xiang Gao
  • Di Wang
چکیده

With the growing need of processing “big data” in real time, modern streaming processing systems should be able to operate at the cloud scale. This imposes challenges to building large scale stream processing systems. First, processing tasks should be efficiently distributed to worker nodes with small overhead. Second, streaming data processing should be highly available, despite that failures are common in datacenters. In Spark Streaming [26], the DStream model is proposed to cope the problems aforementioned. DStream stands for discretized stream; data in the incoming stream is divided into small batches for processing. Compared with processing data at the granularity of a record, batch processing has much lower overhead and has a cheaper fault tolerance model. Lineage information of each batch is kept for recomputation when failure occurs. Therefore, fault tolerance can be achieved without duplicating processing nodes. In this paper, we discuss how to optimize query processing in the DStream model. Specifically, we consider the case of Structured Query Language (SQL). SQL provides a declarative interface for the users query on the data. The declarative nature of SQL provides opportunity for query optimization as the execution is decoupled from the semantics of the query. In a streaming system, the same query is executed on similar data over and over again. Hence, the statistics of the data could be obtained for free, as long as the incoming data pattern is not changing abruptly. We study the performance of applying query optimization techniques in the DStream model, and show the advantage of dynamically optimizing stream processing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Two Architectures for Parallel Processing of Huge Amounts of Text

This paper presents two alternative NLP architectures to analyze massive amounts of documents, using parallel processing. The two architectures focus on different processing scenarios, namely batch-processing and streaming processing. The batch-processing scenario aims at optimizing the overall throughput of the system, i.e., minimizing the overall time spent on processing all documents. The st...

متن کامل

Shared Query Processing in Data Streaming Systems

Shared Query Processing in Data Streaming Systems by Saileshwar Krishnamurthy Doctor of Philosophy in Computer Science University of California, Berkeley Professor Michael J. Franklin, Chair In networked environments there is an increased proliferation of sources (e.g., seismic sensors, financial tickers) that produce live data streams. As a consequence, systems that can manage streaming data h...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Optimizing Latency and Throughput Trade-offs in a Stream Processing System

The value of stream processing systems stems largely from the timeliness of the results these systems provide. Early stream processors followed the record-at-a-time approach, servicing each data point as soon as it arrives at the system. While these systems provide good latency, their behaviors become less desirable when applications require high throughput, fault tolerance, or usage of statefu...

متن کامل

Customer Order Scheduling with Job-Based Processing and Lot Streaming In A Two-Machine Flow Shop

This paper considers a customer order scheduling (COS) problem in which each customer requests a variety of products processed in a two-machine flow shop. A sequence-independent attached setup for each machine is needed before processing each product lot. We assume that customer orders are satisfied by the job-based processing approach in which the same products from different customer orders f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013